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A DISSIMILARITY MEASURE FOR CLUSTERING HIGH- AND INFINITE DIMENSIONAL 
DATA THAT SATISFIES THE TRIANGLE INEQUALITY 

EDUARDO A. SOCOLOVSKY 1 


Abstract. The cosine or correlation measures of similarity used to cluster high dimensional data 
are interpreted as projections, and the orthogonal components are used to define a complementary 
dissimilarity measure to form a similarity -dissimilarity measure pair. Using a geometrical approach, a 
number of properties of this pair is established. This approach is also extended to general inner-product 
spaces of any dimension. These properties include the triangle inequality for the defined dissimilarity 
measure, error estimates for the triangle inequality and bounds on both measures that can be obtained 
with a few floating-point operations from previously computed values of the measures. The bounds and 
error estimates for the similarity and dissimilarity measures can be used to reduce the computational 
complexity of clustering algorithms and enhance their scalability, and the triangle inequality allows the 
design of clustering algorithms for high dimensional distributed data. 

Subject classification. Applied and Numerical Math 

Key words, similarity measures, clustering, high dimensional data, distributed knowledge 
discovery, scalable data mining 

1. Introduction. Clustering is a data analysis technique in which a measure of similarity, or 
equivalently a measure of dissimilarity, is used to detect groups or patterns in data. Traditionally, these 
similarity and dissimilarity measures have been related linearly [11], Clustering of multidimensional data 
is one of the main tools in Knowledge Discovery from Data (KDD), a field that emerged from the need to 
extract useful information from the vast amount of data generated by simulations or measurements. 
Clustering is an essential step in data mining, statistical data analysis, pattern recognition, image 
processing, and can be used to drive data layout in massive distributed datasets, for example, to improve 
the retrieval of data subsets from tertiary systems or minimize the amount of data transferred and stored. 

The most often used measure of similarity is the Euclidean distance between the vectors 
representing the data features. This is adequate for low dimensional data, however, for high dimensional 
data it is well known that the Euclidean distance does not work well. Clustering high-dimensional data in 
pattern recognition and text and scientific data mining continues to attract a significant amount of 
attention and effort, since algorithms have to overcome the “dimensionality curse” [8], and 
simultaneously be scalable and computationally efficient. It has been determined that for high- 
dimensional data, more adequate measures of similarity are the cosine or Pearson’s correlation measure, 
e.g. see [1-7, 11-13, 16, 17, 21], 

The cosine (correlation) similarity measure is the dot-product U* V, where U and V are two unit 
length (zero mean) vectors representing data features. An important problem with these and other 
similarity measures in high dimensions is that, the triangle inequality doesn’t hold , [4, 5]. To illustrate, 
consider the points U=(1,LU,LO,0,0,0,0)/V5 , V=( 1,0, 1,0, 1,0, 1,0, 1,0)/ S, W=(0,0,0,0,0,1,1,U,1)/V5 , 
then U* V = 3/5 , U* W = 0 and W* V = 2/5 which shows that U* V < U* W + W* V does not hold. 
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In this paper, to build upon the suitability of the cosine (correlation) similarity measure for high 
dimensional data, a non-linearly associated smiilarity-dissimilarity measure pair is obtained by 
interpreting the cosine (correlation) similarity measure as the projection of a data point, and defining the 
associated dissimilarity measure d(V,U) to be the length of the orthogonal component. 

It has also been observed that a significant portion of the presently used data analysis techniques 
become unfeasible for very modest size data sets [19, 20], hence it is important to produce new 
algorithms, approaches and tools that help extend the limits of computational feasibility and reduce the 
cost of performing data mining. The key factors are computational efficiency and scalability of the 
algorithms, as well as the scalability of the implementation. The results presented in this paper are a 
continuation of [14], and it is expected that they can help enhance the scalability and computational 
efficiency of algorithms that require a similarity matrix, or multi-step and pre-clustering methods. For 
instance, algorithms with a canopy or k-means approach could be re-designed to could be re-designed to 
(at least in the average case) compute only O(N) inner products to generate approximate similarity and 
dissimilarity matrices or equivalent bounds, instead of the standard 0(N 2 ) inner products. 

Distributed hierarchical algorithms to cluster distributed data, construct an approximate global 
dendrogram from the local dendrograms of the distributed data sets. Generally, they rely on the Euclidean 
distance using bounds and the triangle inequality. A new hierarchical algorithm for heterogeneously 
distributed data sets containing data of high dimensions has been designed using a new measure of 
dissimilarity for distributed data based on the measures studied in this paper. Work is in progress on its 
implementation and an algorithm for homogeneously distributed data. The algorithms and their results 
will be reported in a forthcoming paper. 

To the best of our knowledge, infinite dimensional clustering is presently not used, but it can 
potentially be used to cluster results from simulations or observations of phenomena modeled by PDE’s 
whose solutions are an inner product space, e.g., the standard Sobolev spaces. In this case, the data 
points” could for example be whole domain finite element simulations or observations at a fixed time, 
carrying information on the solution and its derivatives. 

2. Similarity-Dissimilarity Measures Properties. The dissimilarity measure d(V,U) between 
any two normalized vectors U and V is defined as their “orthogonal distance”, i.e., the length of the 
component of V orthogonal to U. It will be shown that the dissimilarity measure satisfies the symmetric 
property and the triangle inequality, which yield the standard bound on differences used in metric spaces. 
As a result of these definitions and properties, bounds on the measures V*W and d(V,W) between any 
two normalized vectors V and W are obtained in terms of already computed measures d(V,U), d(W,U), 
U*V and U*W, as shown in the following paragraphs. Specifically, since 

V = (U*V) U + (I - UU*)V 
we define the dissimilarity measure d(V, U) by 

d(V,U) = ||tf(I-UU*)V || 


and interchanging the roles of U and V 


d(U,V) = || H(l - VV*)U || 

where H is any orthogonal transformation. In this paper H = I the identity is the preferred choice, but for 
algorithms H could for example be a Householder or Givens transformation. 
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Property 2.1* d(V,U) = d(U,V) 


Proof. Using that V* V = U* U = 1 and the fact that (I - LIU*) and (I - VV*) are symmetric 
projections, squaring the dissimilarity measures we obtain: 

d 2 (V,U) = [// (I — UU*)V]* [//(I- UU*)V] = V* (I - UU* ) V = 1 - (U* V) 2 

d 2 (U,V) = [H (I ~ VV*)U]* [H (I - VV* )U] = U* (I - VV*) U = 1 - (U* V) 2 . 

Property 2.2. Given arbitrary’ unitary vectors U, V and W 

d(V,W) < d(V,U) + d(U,W) 

Proof. Let P be the projection of V in the direction of U 

(1) P = (U*V)U 

then V - P = V - (U* V)U = (I - UU*) V is the component of V orthogonal to U, consequently 

(2) d(V,U) = || V - P || 

Similarly, if S = (W*U)W is the projection of U in the direction of W, then U - S = (I - WW*) U is 
the component of U orthogonal to W, and 

(3) d(U,W) = || U - S || 

Now, let Q = (W*P)W be the projection of P in the direction of W, then from (1) and (3) 

(4) ||P - Q|i = || (I - WW*)P|| = |U*V| || (I - WW*)U|| < ||U - S || = d(U,W) 

Also, let R = (W*V)W be the projection of V in the direction of W and V — R = (I — WW*) V be the 
component of V orthogonal to W, then || V - R || < || V - Q || and d(V,W) = || V - R || , which yields 

(5) d(V,W) < || V - Q || 

Finally, from (5), (2) and (4) 

d(V,W) < || V - P || + ||P-Q|| < d(V,U)+ d(U,W) ■ 

Notice that the dissimilarity measure is not a distance , since d(V,U) = 0 when either U = V or U = -V . A 
direct consequence of the definition of d(V,U) and Properties 2.1 and 2.2 are: 

Property 2.3. U* V > 5 if and only if d 2 (V,U) < 1 - 5 2 and U* V > 0 ■ 

Property 2.4. d(V,W) > | d(V,U) - d(W,U) | ■ 
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From properties 2.3 and 2.4 immediately follows 

Property 2.5. If d(V ,U) and d(W,U) h ave bee n previously computed and 
| d(V,U) - d(W,U) | > Vl-<? 2 , then W* V < 5 . ■ 


If the basic clustering criterion is summarized by: “W and V are in the same cluster if and only if W*V 
> S, with S near 1 ”, Property 2.3 yields an equivalent statement in terms of d(W,V) and Property 2.5 
shows that if the difference of the dissimilarities d(V,U) and d(W,U) is large enough, then the vectors V 
and W cannot be in the same cluster. Next, two equivalent bounds on W* V are given in terms of 
previously computed similarity and dissimilarity measures 

Property 2.6. V’W < ( U'V){U'W ) + d(V,U) d(W,U) 

Proof. From the orthogonal factorizations V = P+ (V - P) and W = T + (W - T) , and Cauchy- 
Schwartz we have 

V'W < (U'V)(U'W)+ ||y - p|| \\W - T || 
the definition of the dissimilarity measure yields the result. ■ 


Property 2.7. 
Proof. From 
( 6 ) 


V W 


(u’W-U'vf \d(V,U) - d(W,U)f 


V w = — 

2 


W\ 




and the orthogonal decomposition of W - V 

w-v = {w -t)-(v-p)+ (u'w-u*v)u 

where P is given by (1) and T = (U* W)U . By Pythagora’s theorem 

(7) ||W-yf =py-T)-(V-P)f+(u*W-U*v) 2 


and for any orthogonal transformation H 


l<w -T)-(V- P)|| > III H <W - r>|| - \h <y - P)||| 

and substituting in (7), we obtain 


(8) \\ w ~ v \\ 2 >|||Hor-r)|-|/f(y-p)|| 2 +(f/V-t/V) 2 

substituting (8) into (6), using d(V,U) =||//(y-P)|| and d(W,U) =|ff(W-r)|| yields the result. ■ 

Clearly, since there is a whole hyperplane orthogonal to U, properties 2.4 and 2.6-7 won’t provide sharp 
or conclusive bounds in cases in which the orthogonal lengths dominate and their difference is not large 
enough. However, properties 2.4 and 2.6-7 can be complementary in other ca ses. Pr operty 2.4 is 
inconclusive when the right side is small, i.e., | d(V,U) - d(W,U) | = e with e <Vl -6* , since it only 
says d(V,W) > e , however property 2.7 yields 
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VW<1 (Ww-U'vf £ ! 

2 2 

and for ( U W —U V ) sufficiently large we obtain V*W < 5 , i.e., W and V are not in the same cluster. 

For example, if the components of V and W orthogonal to U are equal (i.e., £ = 0), and the components of 

V and W in the direction of U are of equal length but opposite sign (i.e., U*V = - U*W) and \zA < 

2 

(U V ) we obtain V W < 1 — 2(U V ) ‘ < 5 . Conversely, property 2.5 may hold while property 2.7 is 
inconclusive when | IfV-U'W < 1 - 8 . In this case, from 2.7 it can’t be concluded that V*W < 8 since 

l 1 2 

1 + 3 — jf/F — (7W| >2 S, and the right side of 2.7 is less than or equal to 

i -s 2 |u*v-u'w | 2 

2 ~2 

3. The Infinite Dimensional Case. The properties of section 2 also hold in a general inner- 
product space H. The proofs share the ideas of the finite dimensional case but require a different 
formalism, which is briefly presented in this section. Given an unitary vector U we define a map <J> ■ H 
- H , by 

Q 0 <y)=V-<U,V>U for any Tin H 

The dissimilarity measure d(V, U) between any two normalized vectors U and V is now defined by 

d(V,U)=||c^(F)|| 

Notice that O f/ (V) is the orthogonal component of V with respect to U, since for any V in H 

(9) V =<U ,V >U + (F) 

and for any a 

(10) <aU,® u (V)>=a<U,V-<U,V >U >= 0 
consequently, for any p 

(11) r-A/f =\\® v (V) + (<U,V>-/3)U\\ 2 =||cD u (l/)|| 2 + |<(/, V>-jS\ 2 

The map <3^ has similar properties to the matrix (I - UU*) for the finite dimensional case 
Property 3.1. is a self-adjoint projection. 

Proof. For any V and W in H, by (9) and (10) 

<W,^ V (V) >=< (Y) > + <$[, C W),<U,V >U>=<(P u (W),V > 

®l(V) = <£> u C V )- < U, ® v (V) > U = ® u (V ) . 
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The dissimilarity measure has the same properties as in the finite dimensional case. In effect, for any unit 
vectors U, V and W, we have 
Property 3.2. d(V y U) = d(U , V) 

Proof. Using Property 3.1 and <U,U >=< V,V >= 1 , squaring the dissimilarity measures we obtain: 
d 2 (U,V) =<<!>„ (V),^ ( V ) >=< Oj, 00, V >= 1- < U, V > 2 
d 2 (V,U)=<® v (U),<S> v (U)>=<$> v (U),U >=1-<V,U > 2 m 
Property 3.3. d(V, W) <d(V,U) + d(U, W) 

Proof. Let P =<U,V >U be the projection of V in the direction of U, then from (9) V - P = & u (V) , 
and 

(12) d(V,U)= ||V-P| 

Let Q =< W, P > W be the projection of P in the direction of W, then from (9) 

(13) ||P - Q\\ = | <U,V >[ ||C7- < W, U > W\\ < ||0» w (£/)( = d(W, U ) 

Finally, let R =< W, V > W be the projection of V in the direction of W, then from (9), (1 1 ), (12) and 


d(V, W) = fv - i?|| < ||v - Q\ < \\v - p|| + ||p - Qj < d(v, U) + d(U, W) m 

Substituting dot product notation by inner product notation, it is straightforward to verify that the rest of 
the properties in section 2 also hold in a general inner-product space H. 

!- I r<)r Estimates for the Triangle Inequality. The error introduced in approximating d(W,X) 
by d(W,Y)+d(Y,X), W*X, is discussed in this section. The first result obtained is confirmation of the 
intuitive idea that the “raise” of Y, i.e. the distance of Y to span(W,X), is one of the two added 
independent components of the error. Then, estimates for the other component of the error are obtained by 
considering Y in span(W,X). 

LetY’ be the projection of Y onto span(W,X), i.e., <T-r,W> = 0 and (Y-Y',X) = 0, then 

d 2 (W,Y) = \\Y- (Y,W)W\ 2 = ||F - F || 2 + 1 Y'-(Y',W)wf = || Y - Y 1 f + ||r || 2 d 2 (W, F* ) 
d 2 (F, X) = ||F - (F, X)xf - ||F - F' f + |F’-(F’, X)xf = ||F - F’ || 2 + || F' f d 2 (Y * , X) 


* F' 
where F = - 


n 


and since F' = 1 - |F - F' ", it follows that 


<i 2 (F,X) = ||F - F' || 2 (1 - d 2 (F*, X)) + rf 2 (F*, X) 
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d 2 (W,Y) = \Y-Y' f(l -d 2 (W,Y*)) + d 2 (W,Y*) 

which shows that 

d(W, Y) + d(Y, X ) > d(W,Y*) + d(Y * , X) 

Consequently 

inf d(W,Y) + d(Y,X)= inf d(W,Y) + d(Y,X) 

Y YespanfW ,X) 

and to find error bounds and estimates, the above justifies focusing on 

(14) Y = aW + f3X , ||y|| = l 

From the definition of , using (14) and the fact that X and W are unit length, it follows that 

(15) d (W, Y) + d(Y,X)= | aW + /3X- <aW + J3X, W)W\\ + \\aW + j3X - (aW + pX , X> X|| = 

= \/3\ ||X - (X, W> W\\ + |a| | W - (W, X) X\\ = 

= ( \a\ + \P\) d(W ,X) 

In summary, the fundamental result of this section obtained from the triangle inequality and (15), can be 
stated as 

Property 4 . 1 . For Y in the span(W,X) 

(16) d(W,X) < d(W,Y) + d(Y,X) = (\a\ + \j3\)d(W,X) m 

Motivated by Property 4.1, the rest of this section concentrates on obtaining bounds for (|cr| + |/?|) . From 
(14) 

(17) l = a 2 + J3 2 + 2aj3(W,x) 

and it follows from (17) that 

(18) (H + l/?!) 2 = l + 2|or| \j$\-2aj3{W,X) 

which shows that \a\ + 1/?| > 1 , and that for the Y that yields a minimum of \a\ + 1/?| it is necessary that 

(19) afi{W,X)> 0. 

Consequently, (18) can be rewritten 

(2°) |or| + 1/?| = ^1 + 2^11^1 (l-|(W,X)|) 

On the other hand, for any constant c, 0 < c < 1 , from ( 1 7) it follows 

a 2 <{l-2a/3(w,X)}-c/3 2 and j3 2 <{l-2a /3(W,X)}-ca 2 
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multiplying 

a 2 j3 2 < {l-2aj3(W,X)}[(l-c + c){l-2aj3(w,X)}-ca 2 -c/3 2 \+c 2 a 2 /3 2 

and using (17) 


which yields 


1-c 2 ) a 2 p 2 < (l - c) (l - 2a /? (w , X )} 2 


MM< -j==k-T-<xP(W,X)} 
Vl + c 


1 


defining k = - , and using (19) to introduce absolute values 

Vl + c 


{l+2*|(W,Jf)| }|a||^| 


< k 


which gives 


and taking the limit as c -a 1 

( 21 ) 


a \(3 < 


M \fi\ - 


virr + 2 \(w,x) 


+2\(w,x) 


Finally substituting (21) in (20), the following bound is obtained 
Property 4.2. For any Y in span(W,X) satisfying (14) and (19) 


a + p < 



2 

V2 +2\{W,X) 


(l-|(W,Z>|) . 


Table 1 was obtained from Property 4.2, and lists bounds for | a | + \P\ for some standard values of 
| (vt\ x) | (or equivalently, cosine of angles between W and X): 

Table 1 


For (W,Y) > 

1/2 

V2/2 

V3/2 

\a\ + \p\ < 

fT = 1.1892 

1 / (1 + V2)/2 =1.09868 


h + (2-j3)(S-42) =1.0417 


5. Optimal Error Estimates for the Triangle Inequality. In this section conditions to minimize 
d(W ,Y) + d(Y, X) are sought. First the optimal directions are determined and then the error for those 
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directions is found. The arguments given in section 4 show that it is sufficient to consider the set of Y’s 
satisfying (14), and according to (16) the optimum is found minimizing |or| + ]/?| . For convenience Y is 
written 


(22) Y=-= aW + bX =aW + /3X 

\\aW + bX\\ 

so that |a| + J/?| = / (a, b) becomes a function of a and b given by 

H+H 


(23) 


f(a,b) = 


a 2 +2ab(W,X) + b : 


df 


A standard gradient calculation shows that = 0 if and only if 

3 a 


(24) signed) (a 2 + 2ab(W ,X) + b 2 ) - (\a\ + \b\)(a + b(W ,X)) = 0 
and (24) reduces to 

(25) (|a| - \b\ ) \b\ ( sign(b ) (W,X)~ sign(a) ) = 0 
Equation (25) holds for 

(26) |a| = |*| 

or for (w, x) = sign(a) / sign(b) , which is the trivial case where W is co-linear with X. Similar results are 
3 f 

obtained from = 0 . From (19) and (26) it follows that 

3 b 

(27) a~b for {W,X)> 0 

(28) a = -b for {W,X)< 0 

Substituting in (23), both (27) and (28) yield 


(29) 


f(a,b ) 



_2 

(W,X) 


Table 2 was obtained from (29) and lists optimal bounds for |or| + |/?| 
| (w, x) | (or equivalently, cosine of angles between W and X): 


for some standard values of 


Table 2 


For 

(w,x) > 

1/2 

V 2/2 

V 3/2 


H+H * 

11547 

^2 yj ( 2 -^) =1.08239 

/ = 1.03528 

/V 2 + V 3 


Table 3 


(w,x) 

0 

1/2 

V2/2 

V 3/2 

J(W,X) 

1 

V3/2 

V 2/2 

1/2 

d(WJ) + d{Y,X) 

V2 

1 

1 

A - V2 = 0.765367 

2^2- Vi = 0.517638 
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Table 3 illustrates the estimates obtained with the triangle inequality for the optimal case (27) and some 
standard values of (w , X } 
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